Mini-challenge 3

A short description of the post.

Qian Ziwei https://example.com/norajones
07-11-2021

1. Overview


1.1 Background

In the country island of Kronos, the increasing noxious effects on health and farming have been related to the uncontrolled activity of GAStech, a natural gas operator, supported by corrupt government officials. On January 20th, 2014, a corporate meeting is held to celebrate the new-found fortune because of the initial public offering of the company. However, a series of rare events occur that lead to the disappearance of several employees. The Protectors of Kronos (POK), a social movement organization that has been fighting against water contamination and government corruption, is suspected in the disappearance.

As analysts, we were assigned with several tasks in order to identify risks and how they could have been mitigated more effectively.

1.2 Literature review

1.3 Objective

Using data and visual analytics to evaluate the changing levels of risk to the public and recommend actions for first responder:

1.4 Question 1

1.5 Question 2

1.6 Question 3

1.7 Question 4


2. Building the visualization


2.1 Setting up the environment/packages

First, we run this fist line of code to clear the environment and remove existing R objects(if any).

rm(list = ls())

This code chunk checks if required packages are installed. If they are not installed, the next line of code will install them. The following line is then use to import the library into the current working environment.

packages = c('readr','tidytext','data.table','lubridate','ggplot2',
             'caret','dplyr','tidyr','scales','quanteda','textdata',
             'stringr','stringi','reshape2','RColorBrewer','wordcloud',
             'forcats','igraph','ggraph','widyr','clock','knitr','tidyverse',
             'DT','hms','ggiraph','topicmodels','raster','sf','maptools',
             'rgdal','ggmap','sp','mapview','tmap','gifski')
for(p in packages){
  if(!require(p,character.only = TRUE)){
    install.packages(p)
  }
  library(p,character.only = TRUE)
}

2.2 Importing data and changing data type

First, use read_csv() to import the csv file.

read1 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1700-1830.csv",
                  col_types = list(col_character(),col_character(),col_character(),
                                   col_character(),col_double(),col_double(),
                                   col_character()))
read2 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1831-2000.csv",
                  col_types = list(col_character(),col_character(),col_character(),
                                   col_character(),col_double(),col_double(),
                                   col_character()))
read3 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-2001-2131.csv",
                  col_types = list(col_character(),col_character(),col_character(),
                                   col_character(),col_double(),col_double(),
                                   col_character()))

Using function rbind() combine these three csv files with the same format.

df <- rbind.data.frame(read1,read2,read3)
glimpse(df)
Rows: 4,063
Columns: 7
$ type                   <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <chr> "20140123170000", "20140123170000", "~
$ author                 <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message                <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
DT::datatable(df,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf'))) 

From the above table the date(yyyyMMddHHmmss) is not in time format, so converting to date-time field. Because in the mini-challenge 3, all activities occur on the same day.Extract time(hms) data without date and transform.

df$`date(yyyyMMddHHmmss)` <- date_time_parse(df$`date(yyyyMMddHHmmss)`,
                                 zone = "",
                                 format = "%Y%m%d %H%M%S")
df$time <- as_hms(ymd_hms((df$`date(yyyyMMddHHmmss)`)))
glimpse(df)
Rows: 4,063
Columns: 8
$ type                   <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <dttm> 2014-01-23 17:00:00, 2014-01-23 17:0~
$ author                 <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message                <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
$ time                   <time> 17:00:00, 17:00:00, 17:00:00, 17:00:~
DT::datatable(df,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

2.3 Question 1

2.3.1 Data processing

In question 1, extract the required data from the df to make a new data frame. After reading all the data carefully, the terminology of mbdata and ccdata is very different, so separate the two files. Since ccdata is a police or fire department record, this dataset is labeled as a meaningful dataset.

df1 <- subset(df, select = c("type","author","message"))
df_m <- subset(df1, type == "mbdata")
df_cc <- subset(df1,type == "ccdata")
df_cc$condition <- "meaningful"
DT::datatable(df_cc,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
2.3.1.1 Junk message

JUNk definition: After reading all the data carefully, I have selected the following types of data.

junk <- df_m %>%
  filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
                    blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
                    junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
                    panopticon|dels4realz|eazymoney|cleaningFish")|
           str_detect(message,"#Grammar|RT"))
2.3.1.2 Meaningful message

Meaningful definition: After reading all the data carefully, I have selected the following types of data.

meaningful <- df_m %>%
  filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
  MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
  HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
  Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
  dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
           str_detect(message,[2191 chars quoted with '"']))
2.3.1.3 Meaningless message

Meaningless definition: After reading all the data carefully, I have selected the following types of data.

This group is obtained by subtracting other groups from the df_m through anti_join() function.

meaningful <- dplyr::anti_join(meaningful,junk,by = c("type", "author", "message"))

combinedata <- rbind.data.frame(meaningful, junk)

meaningless <- dplyr::anti_join(df_m,combinedata,by = c("type", "author", "message"))
2.3.1.4 Combining meaningful,meaninfless and junk message

Combine meaningful,meaningless and junk data and add a new label column.

junk$condition <- "junk"
meaningful$condition <- "meaningful"
meaningless$condition <- "meaningless"

finalq1 <- rbind.data.frame(meaningful,junk, meaningless)
DT::datatable(finalq1,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
2.3.1.5 Cleaning the dataset before token

Use stringr package to remove punctuation, @, #, < and Chinese characters from messages. The messages in ccdata are very clean and do not require special handling.

finalq1$message <- str_replace_all(finalq1$message,'[[:punct:]]+', "")

finalq1$message <- str_replace_all(finalq1$message,fixed("@"),"")

finalq1$message <- str_replace_all(finalq1$message,fixed("#"),"")

finalq1$message <- str_replace_all(finalq1$message,fixed("<"),"")

finalq1$message <- str_replace_all(finalq1$message,"[\u4e00-\u9fa5]+", "")
The messages in the table below is clean.
DT::datatable(finalq1,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
2.3.1.6 Token the data and custom stop-words.
tidy_m <- finalq1 %>%
  unnest_tokens(word, message)%>%
  count(condition,word,sort = TRUE)

data(stop_words)
tidy_m <- tidy_m %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
tidy_m <- tidy_m %>%
  anti_join(my_stopwords)

tidy_cc <- df_cc %>%
  unnest_tokens(word,message) %>%
  count(word, sort = TRUE)

DT::datatable(tidy_m,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
DT::datatable(tidy_cc,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:2))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

2.3.2 Simple EDA

The graph below is the word number distribution of junk,meaningful and meaningless group. The junk message has high repetition rate of words.

ggplot(tidy_m,aes(n,fill = condition))+
  geom_histogram(show.legend = FALSE)+
  xlim(0,100)+
  facet_wrap(~condition, ncol = 2,scales = "free_y")

The graph below is the top 15 word(n) of junk,meaningful and meaningless group.

tidy_m %>%
  group_by(condition) %>%
  slice_max(n, n= 15) %>%
  ungroup() %>%
  mutate(word = reorder(word,n)) %>%
  ggplot(aes(x = n,
             y= word,
             fill = condition))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~ condition, ncol = 2,scales = "free")

The graph of ccdata shows that the events like “fire”,“traffic”.

tidy_cc %>%
  slice_max(n,n = 15) %>%
  ggplot(aes(x = n,
             y= reorder(word,n)))+
  geom_col(show.legend = FALSE)+
  labs(y = NULL)

2.3.3 Visualization the ccdata and mbdata

2.3.3.1 Wordcloud

The conclusion is similar to Simple EDA.

wordcloud_m <- tidy_m

wordcloud_m <- finalq1 %>%
  filter(condition == "meaningful") %>%
  unnest_tokens(word, message)%>%
  anti_join(stop_words) %>%
  anti_join(my_stopwords) %>%
  count(word,sort = TRUE)%>%
  with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
  filter(condition == "meaningless") %>%
  unnest_tokens(word, message)%>%
  anti_join(stop_words) %>%
  anti_join(my_stopwords) %>%
  count(word,sort = TRUE)%>%
  with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
  filter(condition == "junk") %>%
  unnest_tokens(word, message)%>%
  count(word,sort = TRUE)%>%
  anti_join(stop_words) %>%
  anti_join(my_stopwords) %>%
  with(wordcloud(word,n,max.words = 100))
tidy_cc %>%
  with(wordcloud(word,n,max.words = 100))

2.3.3.2 tf-idf visualization
m_tf_idf <- tidy_m %>%
  bind_tf_idf(word,condition,n)

m_tf_idf %>%
  arrange(desc(tf_idf))
# A tibble: 4,111 x 6
   condition   word                           n      tf   idf  tf_idf
   <chr>       <chr>                      <int>   <dbl> <dbl>   <dbl>
 1 junk        rt                          1000 0.0558  1.10  0.0613 
 2 junk        kronosstar                   884 0.0493  0.405 0.0200 
 3 junk        homelandilluminations        183 0.0102  1.10  0.0112 
 4 junk        grammar                      157 0.00876 1.10  0.00962
 5 junk        abilapost                    330 0.0184  0.405 0.00746
 6 junk        rally                        260 0.0145  0.405 0.00588
 7 meaningless cards                          9 0.00521 1.10  0.00572
 8 meaningless easycreditkronosmorecredit     9 0.00521 1.10  0.00572
 9 meaningless nobanks                        9 0.00521 1.10  0.00572
10 meaningful  abilapost                     70 0.0137  0.405 0.00555
# ... with 4,101 more rows
m_tf_idf %>%
  group_by(condition) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = condition))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~condition,ncol = 2,scales = "free")+
  labs(x = "word",y = NULL)

2.3.3.3 Bigrams visualization
meaningful_bigrams <- meaningful %>%
  unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningful_bigrams %>%
  count(bigram, sort = TRUE)
# A tibble: 5,775 x 2
   bigram                    n
   <chr>                 <int>
 1 viktor e                 48
 2 of the                   42
 3 dancing dolphin          41
 4 in the                   40
 5 at the                   38
 6 abila centralbulletin    30
 7 pok rally                28
 8 to the                   24
 9 dr newman                23
10 dolphin fire             20
# ... with 5,765 more rows
meaningful_separated <- meaningful_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
meaningful_filtered <- meaningful_separated %>%
  filter(!word1 %in% my_stopwords) %>%
  filter(!word2 %in% my_stopwords)
meaningful_filtered <- meaningful_filtered %>%
    filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
meaningful_counts <- meaningful_filtered %>% 
  count(word1, word2, sort = TRUE)
meaningful_graph <- meaningful_counts %>%
  filter(n > 4) %>%
  graph_from_data_frame()

set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(meaningful_graph,layout = "fr")+
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()
junk_bigrams <- junk %>%
  unnest_tokens(bigram,message,token = "ngrams", n = 2)
junk_bigrams %>%
  count(bigram, sort = TRUE)
# A tibble: 7,338 x 2
   bigram                       n
   <chr>                    <int>
 1 pokrally hi                670
 2 kronosstar pokrally        598
 3 pok rally                  234
 4 rt homelandilluminations   183
 5 rt abilapost               169
 6 rally grammar              157
 7 rt kronosstar              143
 8 if you                     126
 9 of the                     115
10 you can                    102
# ... with 7,328 more rows
junk_separated <- junk_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
junk_filtered <- junk_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
junk_counts <- junk_filtered %>% 
  count(word1, word2, sort = TRUE)
junk_graph <- junk_counts %>%
  filter(n > 50) %>%
  graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(meaningful_graph,layout = "fr")+
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()
meaningless_bigrams <- meaningless %>%
  unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningless_bigrams %>%
  count(bigram, sort = TRUE)
# A tibble: 2,051 x 2
   bigram                           n
   <chr>                        <int>
 1 badprofiles.kronos tacky        12
 2 of the                          12
 3 viktor e                        10
 4 abila nobanks                    9
 5 abila pictures                   9
 6 cards get                        9
 7 credit cards                     9
 8 easy credit                      9
 9 easycredit.kronos morecredit     9
10 get what                         9
# ... with 2,041 more rows
meaningless_separated <- meaningless_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
meaningless_filtered <- meaningless_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
meaningless_counts <- meaningless_filtered %>% 
  count(word1, word2, sort = TRUE)
meaningless_graph <- meaningless_counts %>%
  filter(n > 4) %>%
  graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(meaningful_graph,layout = "fr")+
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

2.3.3.4 Cor visualization
m_words_cor <- tidy_m %>%
  group_by(condition) %>%
  filter(n()>=20) %>%
  pairwise_cor(word,condition,sort = TRUE)

Repeating the process above and get a whole meaningful dataset.

q2_m <- subset(df, type == "mbdata")
q2_cc <- subset(df,type == "ccdata")
q2_junk <- q2_m %>%
    filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
                    blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
                    junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
                    panopticon|dels4realz|eazymoney|cleaningFish")|
           str_detect(message,"#Grammar|RT"))


q2_meaningful <- q2_m %>%
  filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
  MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
  HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
  Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
  dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
           str_detect(message,[2191 chars quoted with '"']))


q2_meaningful <- dplyr::anti_join(q2_meaningful,q2_junk)
2.4.1.2 Dividing three events and processing data separately
POK event

pokrally,Abila City Park,Stand Up Speak Up Sylvia Marek, Dr. Audrey McConnell Newman, Professor Lorenzo Di Stefano, Lucio Jakab and Viktor-E

q2_rally_m <- q2_meaningful %>%
  filter(str_detect(message,"pokrally|Abila City Park|Stand Up Speak Up|Sylvia Marek|Audrey McConnell Newman, Professor Lorenzo Di Stefano|Lucio Jakab|Viktor-E|Sylvia|Marek|Newman|Stefano|Di Stefano|Lucio|Jakab"))

q2_rally_cc <- q2_cc %>%
  filter(str_detect(message,"ABILA CITY PARK|CROWD"))

q2_rally <- rbind(q2_rally_m,q2_rally_cc)

DT::datatable(q2_rally,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_rally$message <- str_replace_all(q2_rally$message,'[[:punct:]]+', "")

q2_rally$message <- str_replace_all(q2_rally$message,fixed("@"),"")

q2_rally$message <- str_replace_all(q2_rally$message,fixed("#"),"")

q2_rally$message <- str_replace_all(q2_rally$message,fixed("<"),"")

q2_rally$message <- str_replace_all(q2_rally$message,"[\u4e00-\u9fa5]+", "")
q2_rally_tidy <- q2_rally %>%
  unnest_tokens(word, message)

data(stop_words)
q2_rally_tidy <- q2_rally_tidy %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_rally_tidy <- q2_rally_tidy %>%
  anti_join(my_stopwords)

DT::datatable(q2_rally_tidy,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Fire in Dancing Dolphin
q2_fire_m <- q2_meaningful %>%
  filter(str_detect(message,"fire|dolphin|dancing|building|apartment|Madeg|dispatch|afd|floor|floors|fireman|firefighters|firefighter|evacuate|evacuated|evacuating|evacuation|trapped|injuries|scene|trapped|collapsed|blaze|escalated"))

q2_fire_cc <- q2_cc %>%
  filter(str_detect(message,"Fire|Crime|scene"))

q2_fire <- rbind(q2_fire_m,q2_fire_cc)

DT::datatable(q2_fire,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_fire$message <- str_replace_all(q2_fire$message,'[[:punct:]]+', "")

q2_fire$message <- str_replace_all(q2_fire$message,fixed("@"),"")

q2_fire$message <- str_replace_all(q2_fire$message,fixed("#"),"")

q2_fire$message <- str_replace_all(q2_fire$message,fixed("<"),"")

q2_fire$message <- str_replace_all(q2_fire$message,"[\u4e00-\u9fa5]+", "")
q2_fire_tidy <- q2_fire %>%
  unnest_tokens(word, message)

data(stop_words)
q2_fire_tidy <- q2_fire_tidy %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_fire_tidy <- q2_fire_tidy %>%
  anti_join(my_stopwords)

DT::datatable(q2_fire_tidy,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
From hit-and-run accident to shooting and standoff
q2_accident_m <- q2_meaningful %>%
  filter(str_detect(message,"shooting|stanoff|hostage|swat|negotiation|fight|arrest|hit|van| driver|bicyclist|accident|incident|bike|L829|pursuit|gun|shot|kill|dead|yelling|screaming|negotiatingnegotiator|caught|over|end|shoot|shot|chasing"))

q2_accident_cc <- q2_cc %>%
  filter(str_detect(message,"van|pursuit|accident|vandalism|swat"))

q2_accident <- rbind(q2_accident_m,q2_accident_cc)

DT::datatable(q2_accident,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = TRUE, columnDefs = list(list(width = '400px', targets = c(4))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
q2_accident$message <- str_replace_all(q2_accident$message,'[[:punct:]]+', "")

q2_accident$message <- str_replace_all(q2_accident$message,fixed("@"),"")

q2_accident$message <- str_replace_all(q2_accident$message,fixed("#"),"")

q2_accident$message <- str_replace_all(q2_accident$message,fixed("<"),"")

q2_accident$message <- str_replace_all(q2_accident$message,"[\u4e00-\u9fa5]+", "")
q2_accident_tidy <- q2_accident %>%
  unnest_tokens(word, message)

data(stop_words)
q2_accident_tidy <- q2_accident_tidy %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q2_accident_tidy <- q2_accident_tidy %>%
  anti_join(my_stopwords)

DT::datatable(q2_accident_tidy,filter = 'top',
              extensions = 'Buttons',
              options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
                             dom='Bfrtip',
                             buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))

2.4.2 Visualization different events

2.4.2.1 POK event
q2_rally_tidy%>%
  filter(str_detect(word,"rally")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y ="rally")
q2_rally_tidy%>%
  filter(str_detect(word,"pok")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y ="pok")
q2_rally_tidy%>%
  filter(str_detect(word,"sylvia|marek")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y ="sylvia|marek")
q2_rally_tidy%>%
  filter(str_detect(word,"audrey|mcConnell|newman")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
  filter(str_detect(word,"lucio|jakab")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
  filter(str_detect(word,"lorenzo|di|stefano")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="lorenzo|di|stefano")
q2_rally_tidy%>%
  filter(str_detect(word,"viktor")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y ="viktor")

2.4.2.2 Fire in Dancing Dolphin
q2_fire_tidy%>%
  filter(str_detect(word,"fire|dolphin|dancing|building|apartment")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y = NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"floor|floors|upper||resident|trapped")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"afd|police|cop|cops|fireman|firefighter")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"ambulance|injury|injuries|evacuated|evacuating|evacuation|evacuate")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)
q2_fire_tidy%>%
  filter(str_detect(word,"control")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y = "control")
q2_fire_tidy%>%
  filter(str_detect(word,"collapsed|blaze|escalated|explosion")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)

2.4.2.3 From hit-and-run accident to shooting and standoff
q2_accident_tidy%>%
  filter(str_detect(word,"hit|run|van|bicyclist|driver|incident|accident|bike|pursuit")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y = NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"gun|shoot|shot|histage")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"killed||dead")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
    labs(y =NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"standoff|negotiating|negotiate|negotiator|negotiation|")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)
q2_accident_tidy%>%
  filter(str_detect(word,"yelling|screaming|chasing|standoff")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y = "control")
q2_accident_tidy%>%
  filter(str_detect(word,"end|over|caught|rescued")) %>%
  ggplot(aes(x = time)) +
  geom_histogram()+
  theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
  labs(y =NULL)

2.5 Question 3

2.5.1 Data processing

q3_lda <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))
q3_lda <- na.omit(q3_lda)

tidy_q3 <- q3_lda %>%
  unnest_tokens(word,message)

q3_wordcount <- tidy_q3 %>%
  anti_join(stop_words)

my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
                                "yay","ya","xx3942","wuz","wow",
                                "dr"))
q3_wordcount <- q3_wordcount %>%
  anti_join(my_stopwords) %>%
  count(author,word,sort = TRUE)
q3_wordcount
# A tibble: 453 x 3
   author        word             n
   <chr>         <chr>        <int>
 1 footfingers   pok             21
 2 truccotrucco  standoff        21
 3 truccotrucco  im              12
 4 truccotrucco  shooting        12
 5 truccotrucco  gelatogalore     9
 6 truccotrucco  van              9
 7 truthforcadau viktor           8
 8 footfingers   kronos           7
 9 dangermice    abilafire        6
10 footfingers   people           6
# ... with 443 more rows
q3_dtm <- q3_wordcount %>%
  cast_dfm(author,word,n)
bgmap <- raster("F:/visual/assignment and project/MC3/MC3/Geospatial/MC2-tourist.tif")

abala_st <- st_read(dsn = "F:/visual/assignment and project/MC3/MC3/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `F:\visual\assignment and project\MC3\MC3\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
q3_gps <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))

gps_m <- na.omit(q3_gps)
gps_sf <- st_as_sf(gps_m,
                   coords = c("longitude","latitude"),
                   crs = 4326)
gps_point <- gps_sf %>%
  group_by(message) %>%
  summarize(m = time,
            do_union = FALSE) %>%
  st_cast("MULTIPOINT")

q3_fire_gps <- subset(q2_fire_m,select = c("type","date(yyyyMMddHHmmss)","author",
                                 "message","latitude","longitude","time"))
gps_fire_m <- na.omit(q3_fire_gps)
gps_fire_sf <- st_as_sf(gps_fire_m,
                   coords = c("longitude","latitude"),
                   crs = 4326)
gps_fire_sf
Simple feature collection with 15 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.856 ymin: 36.059 xmax: 24.894 ymax: 36.059
Geodetic CRS:  WGS 84
# A tibble: 15 x 6
   type   `date(yyyyMMddHHmms~ author  message                time    
 * <chr>  <dttm>               <chr>   <chr>                  <time>  
 1 mbdata 2014-01-23 18:42:10  danger~ "i think the dancin d~ 18:42:10
 2 mbdata 2014-01-23 18:55:01  danger~ "Police are flooding ~ 18:55:01
 3 mbdata 2014-01-23 18:55:05  danger~ "Oh man they are evac~ 18:55:05
 4 mbdata 2014-01-23 19:00:00  danger~ "Someone just got res~ 19:00:00
 5 mbdata 2014-01-23 19:16:07  danger~ "Wow - more police #a~ 19:16:07
 6 mbdata 2014-01-23 19:20:00  danger~ "More fire trucks are~ 19:20:00
 7 mbdata 2014-01-23 19:22:00  danger~ "They are announcing ~ 19:22:00
 8 mbdata 2014-01-23 19:27:00  danger~ "They are getting rea~ 19:27:00
 9 mbdata 2014-01-23 19:41:33  trucco~ "why didnt i stay at ~ 19:41:33
10 mbdata 2014-01-23 19:45:11  trucco~ "i think were trapped~ 19:45:11
11 mbdata 2014-01-23 20:00:00  danger~ "They just pulled a f~ 20:00:00
12 mbdata 2014-01-23 20:05:00  danger~ "They just put him in~ 20:05:00
13 mbdata 2014-01-23 20:05:00  sofite~ "#APD is evacuating u~ 20:05:00
14 mbdata 2014-01-23 20:34:00  trucco~ "arent swat suppposed~ 20:34:00
15 mbdata 2014-01-23 21:30:00  danger~ "OMG was that an expl~ 21:30:00
# ... with 1 more variable: geometry <POINT [°]>
gps_fire_point <- gps_fire_sf %>%
  group_by(message) %>%
  summarize(m = `time`,
            do_union = FALSE) %>%
  st_cast("MULTIPOINT")
tmap_mode("view")
tm_shape(bgmap)+
  tm_rgb(r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_point) +
  tm_dots(col = "red") 
tmap_mode("view")
tm_shape(bgmap)+
  tm_rgb(r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_fire_point) +
  tm_dots(col = "red")
q3_author_lda <- LDA(q3_dtm,k = 3, control = list(seed = 1234))

q3_topics <- tidy(q3_author_lda,matrix = "beta")

q3_topics
# A tibble: 1,092 x 3
   topic term         beta
   <int> <chr>       <dbl>
 1     1 pok      1.37e- 1
 2     2 pok      2.64e-82
 3     3 pok      1.84e- 2
 4     1 standoff 1.25e-81
 5     2 standoff 2.41e- 4
 6     3 standoff 7.63e- 2
 7     1 im       3.95e-82
 8     2 im       5.06e- 5
 9     3 im       4.28e- 2
10     1 shooting 1.24e-87
# ... with 1,082 more rows
q3_topics %>%
  group_by(topic) %>%
  top_n(10,beta) %>%
  ungroup() %>%
  arrange(topic,-beta) %>%
  ggplot(aes(beta,term,fill = topic))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic,ncol = 2,scales = "free")